Natural language processing keyword analysis

ABSTRACT

As disclosed herein, a method for generating a natural language processing query includes receiving one or more documents, wherein each document comprises a set of words, processing the one or more documents and the sets of words to provide a document content matrix V, a word feature matrix W, and a document feature matrix H, forecasting values for each entry of the word feature matrix and the document feature matrix over a selected time interval and a selected set of domains to provide a forecasted word feature matrix W′ and a forecasted document feature matrix H′, calculating a set of coefficients for forecasted document feature matrix H′ such that V=W′*H′, determining a rank for each word of the sets of words according to the calculated set of coefficients, and generating one or more queries according to the determined ranks for each word of the set of words.

BACKGROUND OF THE INVENTION

The present invention relates generally to the field of natural languageprocessing, and more specifically to generating natural languageprocessing queries.

Natural language processing (NLP) is a field of computer science,artificial intelligence, and computational linguistics concerned withthe interactions between computers and human (natural) languages. Inparticular, NLP is concerned with programming computers to process largenatural language corpora. Natural language processing systems are beingdeployed with hundreds of simultaneous users. Users can reformulatequestions based on knowledge returned from an NLP system.

SUMMARY

As disclosed herein, a method for generating a natural languageprocessing query includes receiving one or more documents, wherein eachdocument comprises a set of words, processing the one or more documentsand the sets of words to provide a document content matrix Vcorresponding to the one or more documents and the sets of words, a wordfeature matrix W corresponding to a set of selected features and thesets of words, and a document feature matrix H corresponding to the oneor more documents and the set of selected features, forecasting valuesfor each entry of the word feature matrix and the document featurematrix over a selected time interval and a selected set of domains toprovide a forecasted word feature matrix W′ and a forecasted documentfeature matrix H′, calculating a set of coefficients for forecasteddocument feature matrix H′ such that V=W′*H′, determining a rank foreach word of the sets of words according to the calculated set ofcoefficients, and generating one or more queries according to thedetermined ranks for each word of the set of words. A computer programproduct and a computer system corresponding to the method are alsodisclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a functional block diagram depicting a natural languageprocessing system in accordance with some embodiments of the presentinvention;

FIG. 2 is a flowchart depicting an NLP query creation method inaccordance with at least one embodiment of the present invention;

FIG. 3 is a flowchart depicting a word rank determination method 300 inaccordance with at least one embodiment of the present invention;

FIG. 4 depicts a plurality of matrices in accordance with one embodimentof the present invention; and

FIG. 5 depicts a block diagram of components of a computer, inaccordance with some embodiments of the present invention.

DETAILED DESCRIPTION

Cultural specific semantics such as “slang” terms introduce anadditional complication to natural language processing. With more andmore words carrying both a so-called “dictionary” definition and acompletely separate “slang” definition, it becomes more important toutilize contextual information to train natural language processingsystems to perform parallel searches according to these separatedefinitions.

The present invention will now be described in detail with reference tothe Figures. Implementation of embodiments of the invention may take avariety of forms, and exemplary implementation details are discussedsubsequently with reference to the Figures.

FIG. 1 is a functional block diagram depicting a natural languageprocessing system 100 in accordance with some embodiments of the presentinvention. As depicted, natural language processing system 100 includestwo computing systems 110 (i.e., 110A and 110B), a query generationapplication 112, a natural language processing database 116, and anetwork 130. It should be noted that, while FIG. 1 depicts separatecomputing systems 110 hosting query generation application 112 andnatural language processor 114, in another embodiment these services mayall be hosted on the same computing system. Natural language processingsystem 100 enables NLP query generation that leverages parallel searchesbased on multiple definitions of a keyword.

Computing systems 110 can be desktop computers, laptop computers,specialized computer servers, or any other computer systems known in theart. In some embodiments, computing systems 110 represent computersystems utilizing clustered computers and components to act as a singlepool of seamless resources. In general, computing systems 110 arerepresentative of any electronic devices, or combinations of electronicdevices, capable of executing machine-readable program instructions, asdescribed in greater detail with regard to FIG. 5.

As depicted, computing system 110A comprises a query generationapplication 112. Query generation application 112 may be configured toexecute an NLP query creation method to generate NLP queries accordingto a plurality of entries. In at least one embodiment, query generationapplication 112 is configured to receive a set of one or more documents.Query generation application 112 may be further configured to processand analyze the received documents. One embodiment of an appropriate NLPquery creation method is described with respect to FIG. 2. Querygeneration application 112 may also be configured to execute a word rankgeneration method, such as the one described with respect to FIG. 3.

As depicted, computing system 110B comprises an NLP database 116. NLPdatabase 116 may be a data store in which a plurality of data itemsrelating to natural language processing tasks are stored. For example,NLP database 116 may store dictionary information for multiplelanguages, semantic information for the appropriate languages, userinformation for users within a relevant network, search histories,search preferences, etc. While NLP database 116 is depicted as beinghosted on a separate computing system from the query generationapplication 112, it should be appreciated that in other embodiments,query generation application 112 and NLP database 116 may coexist on thesame system.

Network 130 can be, for example, a local area network (LAN), a wide areanetwork (WAN) such as the Internet, or a combination of the two, andinclude wired, wireless, or fiber optic connections. In general, network130 can be any combination of connections and protocols that willsupport communications between computing system 110A and 110B inaccordance with an embodiment of the present invention. In at least oneembodiment of the present invention, network 130 transmits NLP queriesbetween computing system 110A and computing system 110B.

FIG. 2 is a flowchart depicting an NLP query creation method 200 inaccordance with at least one embodiment of the present invention. Asdepicted, NLP query creation method 200 includes receiving (210) one ormore documents each comprising a set of words, processing (220) the oneor more documents to provide a document content matrix “V”, identifying(230) features corresponding to each word from the sets of words,creating (240) a word feature matrix “W” corresponding to the identifiedfeatures from the plurality of words, creating (250) a document featurematrix “H” corresponding to the documents and a set of coefficients forthe identified features, processing (260) matrices “V”, “W”, and “H” toprovide a determined rank for each word from the sets of words, andgenerating (270) one or more questions according to the determined rankfor each word. NLP query creation method 200 enables query generationaccording to parallel searches based on multiple definitions of akeyword.

Receiving (210) one or more documents each comprising a set of words mayinclude receiving a user initiated keyword search. In at least oneembodiment, query generation application 112 identifies historicalsearches by the user, semantic meanings and surface level contoursassociated with the keyword search, and searches conducted by otherusers that are related to the keyword search. Query generationapplication 112 analyzes the results of each of these identifiedsearches and accompanying metrics to identify one or more documents tobe analyzed, wherein each of the one or more documents comprises a setof words. In other words, each time a search is executed, the searchterms and a list of documents identified by the search are received. Insome embodiments, query generation application 112 sets a maximum on thenumber of documents that will be received, or that will be analyzed. Incases where the received documents include pictures or any content otherthan text, a configurable setting is available to indicate whether ornot to process text within images or videos, where applicable.

Processing (220) the one or more documents to provide a document contentmatrix “V” may include identifying each word within each document. Querygeneration application 112 creates a column in a matrix “V”corresponding to each document, and creates a row in each matrixcorresponding to each word within a document. In one embodiment, eachtime query generation application 112 encounters a word within adocument, query generation application 112 determines whether or not arow in matrix “V” exists corresponding to said word. If a row doesalready exist, query generation application 112 updates an entry in thecolumn corresponding to the document that indicates the word is presentin said document. In at least one embodiment, query generationapplication 112 updates the entry in the appropriate column to indicatehow many times the word appears in the document. If a row does notexist, query generation application 112 creates a row corresponding tothe word prior to creating an entry in the appropriate column. FIG. 4Adepicts one example of a document content matrix 410 wherein each rowcorresponds to a word, and each column corresponds to a document.

In at least one embodiment of the present invention, matrix “V” includesseparate rows for unique contextual appearances of a word. For example,one document may contain the following phrases: “Rotate the screw 270degrees so the arrow is pointing to the right of the board”, and “Beforedrilling the hole, double check to make sure you are using the rightdrill bit.” The word “right” appears in both of these phrases, but has adistinctly different meaning in each phrase. The context in which eachinstance of the word appears is indicative of which definition of theword is appropriate. With respect to this example, query generationapplication 112 creates separate entries for these two unique uses ofthe word “right”. In some embodiments, query generation application 112creates a separate matrix row for each contextually unique appearance ofa word, as determined by the syntax of the phrase or sentencesurrounding the word. In other embodiments, query generation application112 identifies an appropriate definition for each instance of the word,and creates a unique matrix entry only if a row corresponding to theappropriate definition of the word does not exist already.

Identifying (230) features corresponding to each word from the sets ofwords may include analyzing each word with respect to one or moreselected features. In some embodiments, query generation application 112is configured to either receive or determine a set of featurescorresponding to each word. As used herein, a word “feature” refers to acharacteristic of the word, either calculated or observable, many ofwhich correspond directly to how frequently a word is used in aparticular context. Example features may be a word's Indri rank, Lucenerank, language model rank, prismatic cut count, platform rank (wherein aplatform rank is a word's rank on a particular platform, such as asocial media site or other website), user rank for any number of users,user language rank for any number of users, and a word disagreementmetric for any number of users.

In some embodiments, identifying (230) features corresponding to eachword includes receiving feature information that indicates one or morefeatures with respect to which each document is to be analyzed. Forexample, the feature information may indicate that the selected featuresare platform rank, user rank for User A, user rank for user B, andLucene rank. Query generation application 112 may either receive thesefeature values from another source, or may be configured to calculateeach of these features for each word. For example, with respect to theLucene rank feature, query generation application 112 may receive theLucene rank (or Lucene score) from another application that has alreadycalculated it. If no such application or source exists to provide theLucene rank, query generation application 112 is configured to calculatethe Lucene rank for each word according to Lucene's practical scoringfunction. The same concept is applied to each selected feature. One ormore feature combinations may be identified using either theNewton-Raphson method, Stochastic gradient descent (SGD), or any otheriterative method known in the art. In at least one embodiment,identifying (230) features corresponding to each word further includesusing one of these methods to weight multiple features into a singlecombined feature.

The Newton-Raphson method is an iterative method that adjusts featureweights. Each iteration of the method provides a result x_(n+1) that isbased on a result from a previous iteration x_(n). The methodapproximates roots of a function F by making an initial estimate x⁰⁻ fora root of the function F and identifies an improved estimate x₁according to the equation:

x _(n+1) =x _(n) −F(x _(n))/F′(x _(n))   (1)

With respect to equation (1), F′(x_(n)) indicates the derivative of thefunction F at point x_(n). Equation (1) may be used iteratively tocontinually improve the estimate until the changes in the calculatedroot x_(n+1) become minimal upon each iteration, indicating that theapproximate root has been identified. The results of the Newton-Raphsonmethod may identify a set of best feature combinations.

Stochastic descent gradient is a method for identifying a parameter thatminimizes the following equation (2).

Q(w)=(1/n)*ΣQ _(i)(w)   (2)

With respect to equation (2), the summation function is calculated fromi=1 to i=n, wherein Q is an objective function of the variable w, ncorresponds to the number of data points in a dataset, and Q_(i),corresponds to the i^(th) observation in a dataset. The parameterw_(min) that minimizes equation (2) is identified according to thefollowing equation:

w _(n) =w _(n−1) −{acute over (η)}Σ∇Q _(i)(w _(n−1))/n   (3)

With respect to equation (3), w_(n−1) corresponds to a previouslycalculated w_(n) value. Once repeated iterations of equation (3) beginto yield minimal delta values between w_(n) and w_(n−1), anapproximation of parameter w that minimizes equation (2) has beenreached.

Creating (240) a word feature matrix “W” corresponding to the identifiedfeatures from the plurality of words may include generating a matrix “W”that indicates values for each feature displayed by each word. In oneembodiment, each row corresponds to a word identified in one of thereceived documents. In the same embodiment, each column corresponds to afeature to be analyzed for each of the words. Each entry in matrix “W”in said embodiment includes a numerical value indicating how prominentlythe word indicated by the row displays the feature indicated by thecolumn. For some features or feature combinations, the numerical valuesare identified according to the methods discussed with respect to step230. Other features may have easily discernible numerical values; forexample, a word's platform rank. An example word feature matrix 420 isdepicted with respect to FIG. 4, wherein each row corresponds to a wordand each column corresponds to a feature of the word(s).

Creating (250) a document feature matrix “H” corresponding to thedocuments and a set of coefficients for the identified features mayinclude generating a matrix “H” that indicates values for each featuredisplayed across each document. It should be noted that the featuresincluded in the document feature matrix “H” are the same as the featuresincluded in the word feature matrix “W.” In one embodiment, each rowcorresponds to a feature analyzed for the words within the documents. Inthe same embodiment, each column corresponds to one of the receiveddocuments. Query generation application 112 may calculate an aggregatefeature score for each document by calculating the sum of the featurescores from the word feature matrix “W” for each word in the document.In another embodiment, query generation application 112 determines anaggregate feature score by taking the average of the feature score foreach word in said document. The calculated aggregate feature score forthe document is included in the document feature matrix “H” in theappropriate row-column entry. An example document feature matrix 430 isdepicted with respect to FIG. 4, wherein each row corresponds to afeature and each column corresponds to a document.

Processing (260) matrices “V”, “W”, and “H” to provide a determined rankfor each word from the sets of words may include performing matrixoperations to calculate weight coefficients within document featurematrix “H”. In one embodiment, query generation application 112 executesa word rank determination method to provide a ranked set of words. Theranked set of words includes the sets of words from each of thedocuments, wherein the words are ranked according to one or morecalculated document feature scores. Details of one example of anappropriate word rank determination method are discussed with respect toFIG. 3.

Generating (270) one or more questions according to the determined rankfor each word may include using the top ranked words to generate searchphrases or questions corresponding to the original query. The questionsmay be generated using a query template. A query template as used hereinrefers to a template from which queries will be generated using thedetermined word ranks. In one embodiment, a query template may includeslots for a subject, a verb, and an object. Such a template will befilled out by taking the highest ranked subject, the highest rankedverb, and the highest ranked object, and plugging them into thetemplate. A template may include filler terms around the word slots.Each word may be used in a plurality of generated queries. For example,the highest ranked subject may be used in conjunction with the highestranked verb and the highest ranked object in one query, and with thesecond-highest ranked verb and the second-highest ranked object inanother generated query.

FIG. 3 is a flowchart depicting a word rank determination method 300 inaccordance with at least one embodiment of the present invention. Asdepicted, word rank determination method 300 includes receiving (310) adocument feature matrix “H”, a word feature matrix “W”, and a documentcontent matrix “V”, applying (320) a domain tensor and a time tensor todocument feature matrix “H” and word feature matrix “W”, forecasting(330) document feature matrix “H” and word feature matrix “W” accordingto the domain and time tensors, determining (340) coefficient values forword feature matrix “W” such that V =WH, and determining (350) a rankfor each word according to the determined coefficient values for eachfeature. Word rank determination method 300 enables a plurality of wordsto be ranked according to a set of weighted features.

Receiving (310) a document feature matrix “H”, a word feature matrix“W”, and a document content matrix “V” may include query generationapplication 112 receiving a document content matrix wherein each rowcorresponds to a word from within a document, and each columncorresponds to a document. In other words, an entry V_(ij), wherein “i”indicates the row and “j” indicates the column, indicates whether theword corresponding to row “i” exists in the document indicated by column“j.” A word feature matrix “W” may be a matrix wherein each rowcorresponds to a word and each column corresponds to a feature about thewords. In other words, an entry W_(ij), wherein “i” indicates the rowand “j” indicates the column, corresponds to a value indicating howprominently the word indicated by row “i” displays the feature indicatedby column “j.” A document feature matrix “H” may be a matrix whereineach row corresponds to a feature and each column corresponds to adocument. In other words, an entry H_(ij), wherein “i” indicates the rowand “j” indicates the column, corresponds to a value indicating howprominently the feature indicated by row “i” is displayed in thedocument indicated by column “j.” The matrices “H”, “W”, and “V” are thesame matrices as discussed with respect to FIG. 2. In some embodiments,query generation application 112 identifies the matrices in a database.In other embodiments, query generation application 112 creates thematrices as described with respect to FIG. 2.

Applying (320) a domain tensor and a time tensor to document featurematrix “H” and word feature matrix “W” may include adding two moredimensions to the matrices “H” and “W”. A domain as used herein isdefined as a specific subject area (such as animals, books, trucks,locations, etc.) that is used to build language models. Applying (320) adomain tensor may include adding a third degree to the matrices “H” and“W” according to a selected number of domains. For example, if matrix“H” has 10 features (rows) and 500 documents (columns), and 20 domainsare selected, then matrix “H” becomes a 10×500×20 tensor. Similarly, ifmatrix “W” has 10,000 words (rows) and 10 features (columns), and 20domains are selected, then matrix “W” becomes a 10,000×10×20 tensor.

Applying (320) a domain tensor and a time tensor may further includeadding a fourth degree to the third degree tensors “H” and “W” accordingto a selected period of time. Applying (320) a time tensor may furtherinclude specifying a unit of time. Continuing the examples usedpreviously, if “H” has 10 features, 500 documents, and 20 selecteddomains, and a time period of 30 minutes is selected, then “H” becomes a10×500×20×30 tensor. Similarly, if “W” has 10,000 words, 10 features,and 20 selected domains, and a time period of 30 minutes is selected,then “W” becomes a 10,000×10×20×30 tensor.

Forecasting (330) weight tensor “H” and feature tensor “W” according tothe domain and time tensors may include projecting values for a selectedpoint “T” in the selected period of time. For example, continuing withthe example used previously with a selected time period of 30 minutes,the selected point “T” may be the end of that interval (T=30). In thatcase, each value is projected using any number of predictive modelingmethods known in the art. Predictive modeling uses statistics to predictoutcomes, in many cases on the basis of detection theory. Forecastingthe weight tensor “H” and feature tensor “W” with respect to the ongoingexample provides a forecasted (10×500×20) weight tensor “H” and aforecasted (10,000×10×20) feature tensor “W.”

Determining (340) a set of coefficient values for forecasted wordfeature matrix “W” may include executing matrix multiplication such thatV=WH, wherein each entry of forecasted word feature matrix includes avariable corresponding to a selected weight for the feature associatedwith said entry. In at least one embodiment, each entry in a sharedcolumn of forecasted word feature matrix “W” includes the same variable,as each entry in a shared column corresponds to the same feature.Determining (340) a set of coefficient values for forecasted wordfeature matrix “W” may further include scaling the determinedcoefficients to fall within a selected range.

Determining (350) a rank for each word according to the determinedcoefficient values for each feature may include calculating an aggregateweight for each word according to the document feature matrix. In oneembodiment, query generation application 112 calculates an averageacross each row in the document feature matrix to provide an aggregateweight for the corresponding words. In another embodiment, querygeneration application 112 calculates a summation of all entries withina row to provide an aggregate weight for a corresponding word.Determining (350) a rank for each word may further include ordering orsorting the words according to their determined aggregate weights. Querygeneration application 112 may assign a rank to each word according totheir order. In at least one embodiment, query generation application112 assigns ranks 1 through n, where n is the total number of words, anda smaller weight (i.e. a weight closer to 1) indicates a higher weight.In other embodiments, query generation application 112 assigns ranksaccording to a distribution. That is to say, query generationapplication 112 groups words together according to similar calculatedaggregate weights. For a group of 10,000 words, the words may be splitinto 20 groups of 500 words each. The 500 words with the highestaggregate weights are given rank 1, the 500 words with the next highestweights are given rank 2, and so on.

Determining a word rank may further include determining a word type foreach word. For example, query generation application 112 may assign wordtypes according to what part of speech a word is, what topics the wordis associated with, or any other criteria that may be used todifferentiate between two terms. In such embodiments, word rank may beassigned globally (wherein for a group of 500 words, the words areranked 1-500 regardless of the word types), or according to rank(wherein for a group of 500 words with 250 verbs and 250 nouns, eachverb is ranked 1-250 and each noun is ranked 1-250).

FIG. 5 depicts a block diagram of components of computer 500 inaccordance with an illustrative embodiment of the present invention. Itshould be appreciated that FIG. 5 provides only an illustration of oneimplementation and does not imply any limitations with regard to theenvironments in which different embodiments may be implemented. Manymodifications to the depicted environment may be made.

As depicted, the computer 500 includes communications fabric 502, whichprovides communications between computer processor(s) 504, memory 506,persistent storage 508, communications unit 512, and input/output (I/O)interface(s) 514. Communications fabric 502 can be implemented with anyarchitecture designed for passing data and/or control informationbetween processors (such as microprocessors, communications and networkprocessors, etc.), system memory, peripheral devices, and any otherhardware components within a system. For example, communications fabric502 can be implemented with one or more buses.

Memory 506 and persistent storage 508 are computer-readable storagemedia. In this embodiment, memory 506 includes random access memory(RAM) 516 and cache memory 518. In general, memory 506 can include anysuitable volatile or non-volatile computer-readable storage media.

One or more programs may be stored in persistent storage 508 for accessand/or execution by one or more of the respective computer processors504 via one or more memories of memory 506. In this embodiment,persistent storage 508 includes a magnetic hard disk drive.Alternatively, or in addition to a magnetic hard disk drive, persistentstorage 508 can include a solid state hard drive, a semiconductorstorage device, read-only memory (ROM), erasable programmable read-onlymemory (EPROM), flash memory, or any other computer-readable storagemedia that is capable of storing program instructions or digitalinformation.

The media used by persistent storage 508 may also be removable. Forexample, a removable hard drive may be used for persistent storage 508.Other examples include optical and magnetic disks, thumb drives, andsmart cards that are inserted into a drive for transfer onto anothercomputer-readable storage medium that is also part of persistent storage508.

Communications unit 512, in these examples, provides for communicationswith other data processing systems or devices. In these examples,communications unit 512 includes one or more network interface cards.Communications unit 512 may provide communications through the use ofeither or both physical and wireless communications links.

I/O interface(s) 514 allows for input and output of data with otherdevices that may be connected to computer 500. For example, I/Ointerface 514 may provide a connection to external devices 520 such as akeyboard, keypad, a touch screen, and/or some other suitable inputdevice. External devices 520 can also include portable computer-readablestorage media such as, for example, thumb drives, portable optical ormagnetic disks, and memory cards. Software and data used to practiceembodiments of the present invention can be stored on such portablecomputer-readable storage media and can be loaded onto persistentstorage 508 via I/O interface(s) 514. I/O interface(s) 514 also connectto a display 522.

Display 522 provides a mechanism to display data to a user and may be,for example, a computer monitor.

The programs described herein are identified based upon the applicationfor which they are implemented in a specific embodiment of theinvention. However, it should be appreciated that any particular programnomenclature herein is used merely for convenience, and thus theinvention should not be limited to use solely in any specificapplication identified and/or implied by such nomenclature.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the invention.The terminology used herein was chosen to best explain the principles ofthe embodiment, the practical application or technical improvement overtechnologies found in the marketplace, or to enable others of ordinaryskill in the art to understand the embodiments disclosed herein.

1. A computer implemented method for generating natural languageprocessing queries, the method comprising: receiving one or moredocuments, wherein each document comprises a set of words; receiving aset of feature information, wherein the set of feature informationindicates one or more features, and wherein the indicated features arethe features that are analyzed within each of the documents; processingthe one or more documents and the sets of words to provide a documentcontent matrix V corresponding to the one or more documents and the setsof words, a word feature matrix W corresponding to a set of selectedfeatures and the sets of words, and a document feature matrix Hcorresponding to the one or more documents and the set of selectedfeatures; using stochastic gradient descent to calculate a feature valuefor each feature for the sets of words, wherein the feature value is theappropriate word feature matrix entry; forecasting values for each entryof the word feature matrix and the document feature matrix over aselected time interval and a selected set of domains to provide aforecasted word feature matrix W′ and a forecasted document featurematrix H′ by calculating predicted values for each entry according totrends displayed over time for each word feature and trends displayed inthe selected set of domains; calculating a set of coefficients forforecasted document feature matrix H′ such that V=W′*H′; determining arank for each word of the sets of words according to the calculated setof coefficients by calculating an average of the coefficients for eachword, and sorting the words by average; and using a query template togenerate one or more queries according to the determined ranks for eachword of the set of words and according to one or more word types.